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[0005] 

[Problems to be Solved] 

When calculation is to be made in parallel using a plurality of arithmetic 
processing units, calculation can rarely be proceeded with totally independently 
without performing communication between the arithmetic processing units, and 
calculations are usually made while communication between the arithmetic processing 
units are being performed. For example, suppose that a matrix C of four rows by four 
columns is acquired by performing multiplication of matrixes A and B in four rows by 
four columns using four arithmetic processing units. Each element of A, B, and C is 
noted with a u , b Id , and c u as follows: 
[0006] 
[Equation 6] 



b^ b 22 b 23 f> 2 , 
b n b n A 33 f> 3 , 



At this time, in one of the four arithmetic calculation units, calculation is preformed as 

follows: 

[0007] 

[Equation 7] 



a n a n a 13 a 14 
a„ a M a, 4 



c t . c,. 



[0008] 

As obvious from this example, data on elements of all the rows or all the 
columns are required for the side used for the calculation (a Id or by). Also, for c u 
acquired as a result of the calculation, data is obtained only partially in each of the 
arithmetic processing units. This means the fact that if multiplication of the matrix C 
and the matrix A is needed in the subsequent step, for example, data does not become 
sufficient only with the elements obtained by the calculation. Therefore, the 
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remaining portion, that is, at least data of the first row and the second row of the 
matrix C and data of the first column and the second column in the above equation 
should be in the satisfied state after the calculation of A x B = C is made. 
[0009] 

If these problems are generalized, suppose that there is an array X(nk) made 
up of (n x k) pieces, they are divided into n units of arithmetic processing units, and an 
arithmetic processing unit with an identification number 1 has calculation results of 
X(1), X(2), ■■■ X(k), while an arithmetic processing unit with an identification number 2 
has calculation results of X(k + 1), X(k + 2), X(2k). By performing communication 
among n units of the arithmetic processing units in this state, an operation to create a 
state in which n units of the arithmetic processing units have calculation results of 
the array X(nk). 
[0010] 

One-to-one communication is a condition on communicating means at this 
time. That is, if data is to be transferred from an arithmetic processing unit 1 to an 
arithmetic processing unit 2, for example, the arithmetic processing unit 2 needs to be 
brought in a state to receive the data from the arithmetic processing unit 1, and if the 
arithmetic processing unit 2 is to transfer data to another processing such as an 
arithmetic processing unit 3 or to receive data from an arithmetic processing unit 4, 
for example, communication fails and the calculation is interrupted. In order to 
perform the communication without delay, an order of communication should be 
determined in advance in order to avoid confusion on the transmission side and the 
receiving side. 
[0011] 

Using a case in which four arithmetic processing units are used as an example, 
a method that can be conceived easily is as follows: For simplification of notation, 
the arithmetic processing units 1, 2, 3, and 4 are noted as #1, #2, #3, and #4, 
respectively. 

(1) A method of sequentially performing transmission - reception one by one [1] 
calculation result of #1 -> #2, [2] calculation result of #1 -> #3, [3] calculation result 
of #1 -> #4, [4] calculation result of #2 -> #1, [5] calculation result of #2 -> #3, [6] 
calculation result of #2 -> #4, [7] calculation result of #3 -> #1, [8] calculation result 
of #3 -> #2, [9] calculation result of #3 -> #4, [10] calculation result of #4 -> #1 , [1 1] 
calculation result of #4 -> #2, [12] calculation result of #4 -> #3 are sequentially 
performed. 
[0012] 
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Here, [1], [2], [3] ■■■ indicate numbers of processing steps. Supposing that 
there are N units of the arithmetic processing units and a data amount allocated to 
one unit is w, the number of communication times is 2 x N C 2 = N (N - 1), and a data 
movement amount is 2w x N C 2 = wN (N - 1). In the case of N = 4, the number of 
communication times is 12 times as above. According to this method, it takes time, 
but confusion in the communication can be avoided. p C q indicates the number of 
combinations to select q pieces of elements from p pieces of elements. 
[0013] 

(2) After data is collected in the representative arithmetic processing unit, it is 
distributed to each arithmetic processing unit. 

- Data collection [1] calculation result of #2 -> #1, [2] calculation result of #3 -> #1, 
[3] calculation result of #4 -> #1, ■■■ are sequentially performed. All the data is 
collected in #1. 

- Distribution of all the data [1] #1 -> #2, [2] #1 -> #3, [3] #1 -> #4, ■■■ are 
sequentially performed. The entire array is transmitted to #2, #3, and #4. 
[0014] 

The number of communication times in this case is 2(N - 1), and the data 
movement amount is (N - 1)w at the data collection and N(N - 1)w at the data 
distribution. With this method, the number of communication times is smaller than 
that in the method (1), but it has a demerit that the data amount transmitted at the 
distribution of all the data is large. 
[0015] 

Also, a method for promoting efficiency of communication is the following. 

(3) Communication is performed in parallel and exhaustively for one-to-one 
combinations of the arithmetic processing units. This is improvement of the method 
(1) and performed as follows, for example. 

[1] Simultaneous execution of calculation result of #1 -> #2, calculation result of #3 
-> #4 

[2] Simultaneous execution of calculation result of #1 -> #3, calculation result of #2 
-> #4 

[3] Simultaneous execution of calculation result of #1 -> #4, calculation result of #2 

-> #3 

[4] Simultaneous execution of calculation result of #2 -> #1, calculation result of #4 

-> #3 

[5] Simultaneous execution of calculation result of #3 -> #1, calculation result of #4 

-> #2 
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[6] simultaneous execution of calculation result of #4 -> #1 , calculation result of #3 -> 
#2 

[0016] 

According to this communication method, communication is not duplicated or 
collided but all the data is distributed to the four arithmetic processing units. If the 
number of arithmetic processing units is N, the number of communication times is 
2(N-1), and the data movement amount is 2(N-1)w. In the case of N = 4, time 
required for communication is a half of the method described in (1). The larger N 
becomes, the difference is widened. 
[001 7] 

(4) After data is collected in the representative arithmetic processing unit by Binary 
Tree of the arithmetic processing units, the data is distributed to each of the 
arithmetic processing units. This method is improvement of the method (2) and is 
executed as follows: 

- Data collection [1] calculation result of #2 -> #1 and calculation result of #4 -> #3 
are simultaneously executed. 

[2] Calculation result collected in #3 -> #1, distribution of all the data 
[3] #1 -> #3 

[4] Simultaneous execution of #1 -> #2 and #3 -> #4. 
[0018] 

According to this method, if the number of arithmetic processing units is N, 
the number of communication times is 2 x log 2 N, and the data communication amount 
is (N-1)w at the collection and Nw log 2 N at the distribution. In the case of N = 4, the 
number of communication time is 2/3 of the method (2), and the data movement 
amount is 1 1/15 of the method (2). The larger N becomes, the difference is widened. 
[001 9] 

Since the number of communication times is large, the method (3) is not 
suitable if an array to be handled is small though the data movement amount is small 
as compared with the method (4). Since the data movement amount is large, the 
method (4) is not suitable if an array to be handled is large though the number of 
communication times is small. 
[0020] 

Thus, a generalized method is needed in which the data movement amount 
and the number of communication times are both optimized and which can be applied 
under any condition. The present invention was made in view of these problems, and 
has an object to provide a parallel computer system and a communication method 
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between arithmetic processing units that can minimize the number of communication 
times between arithmetic processing units and waiting time at data transmission / 
reception and realize high-speed processing by enabling parallel data transmission / 
reception via communication. 
[0021] 

[Means for Solving the Problems] 

In order to achieve the above object, an invention described in claim 1 of the 
present invention is, in a parallel computer system provided with at least 2" units of 
arithmetic processing units having unique identifiers and individual storage devices 
and communication means corresponding to each of the arithmetic processing units, 
respectively, for performing data transmission / reception between each of the 
arithmetic processing units by the communication means, characterized in that when 
data arrays which were divided into 2" pieces of small arrays and distributed to the 2" 
units of the arithmetic processing units for calculation processing at each of the 
arithmetic processing units are to be collected in a single array again, identification 
numbers 0, 1, 2" -1 are given to 2" units of the arithmetic processing units, an 
arithmetic processing unit with an identification number N' obtained by reversing a 
numeral on the 2' digit of the identification number N noted in binary is made to 
correspond to the arithmetic processing unit with the identification number N, an 
operation i for mutually transmitting / receiving a calculation processing result of the 
data array between the arithmetic processing unit with the identification number N 
and the arithmetic processing unit with the identification number N' is sequentially 
executed from i = 0 to i = n - 1 , and for j > 0, the calculation processing result obtained 
till the operation (j - 1) is transmitted / received between the arithmetic processing 
units with the identification numbers N and N' in the operation j, in addition to the 
calculation processing result by each of the arithmetic processing units so that the 
data array is collected in the n times of operations between 2" units of the arithmetic 
processing units. 
[0022] 

Also, an invention described in claim 2 is, in a parallel computer system 
provided with (2 m + k) units of arithmetic processing units having unique identifiers and 
individual storage devices and communication means corresponding to each of the 
arithmetic processing units, respectively, for performing data transmission / reception 
between each of the arithmetic processing units by the communication means, 
characterized in that when data arrays which were divided into (2 m + k) pieces of small 
arrays and distributed to / calculated and processed at the (2 m + k) units of the 
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arithmetic processing units are to be collected in a single array again, an arithmetic 
processing unit group made up of 2 m+1 units in which (2 m - k) units of the arithmetic 
processing units provided with individual storage means and communication means 
are added to (2 m + k) units of the arithmetic processing units is formed, identification 
numbers 0, 1, 2 m+1 - 1 are given to 2 m+1 units of the arithmetic units constituting the 
arithmetic processing unit group, the arithmetic processing unit with an identification 
number N' obtained by reversing a numeral on the 2' digit of the identification number 
N noted in binary is made to correspond to the arithmetic processing unit with the 
identification number N, an operation i for mutually transmitting / receiving a 
calculation processing result of the data array between the arithmetic processing unit 
with the identification number N and the arithmetic processing unit with the 
identification number N' is sequentially executed from i = 0 to i = m, and for j > 0, the 
calculation processing result of the arithmetic processing unit with the identification 
number N to become N ^ 2 m + k and the calculation result obtained till the operation 
(j - 1) are transmitted from that arithmetic processing unit and the calculation 
processing result obtained till the operation (j — 1 ) is transmitted from the arithmetic 
processing unit with the identification number N to become N > 2 m + k in the operation 
j so that the data array is collected in the (m + 1) times of operations between (2 m + k) 
units of the arithmetic processing units. 
[0023] 

Also, an invention described in claim 3 is, in a parallel computer system 
provided with (2 m + k) units of arithmetic processing units having unique identifiers and 
individual storage devices and communication means corresponding to each of the 
arithmetic processing units, respectively, for performing data transmission / reception 
between each of the arithmetic processing units by the communication means, 
characterized in that when data arrays which were divided into (2 m + k) pieces of small 
arrays and distributed to / calculated and processed at the (2 m + k) units of the 
arithmetic processing units are to be collected in a single array again, the data array is 
extended to 2 m+1 pieces of small arrays by adding (2 m - k) pieces of empty small arrays 
to the (2 m + k) pieces of data arrays, an arithmetic processing unit group made up of 
2 m+1 units in which (2 m - k) units of the arithmetic processing units provided with 
individual storage means and communication means are added to (2 m + k) units of the 
arithmetic processing units is formed, identification numbers 0, 1, 2 m+1 - 1 are given 
to 2 m+1 units of the arithmetic processing units constituting the arithmetic processing 
unit group, the arithmetic processing unit with an identification number N' obtained by 
reversing a numeral on the 2' digit of the identification number N noted in binary is 
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made to correspond to the arithmetic processing unit with the identification number N, 
an operation i for mutually transmitting / receiving a calculation processing result of 
the data array between the arithmetic processing unit with the identification number N 
and the arithmetic processing unit with the identification number N' is sequentially 
executed from i = 0 to i = m, and for j > 0, the calculation processing result obtained till 
the operation Q - 1) is transmitted / received between the arithmetic processing units 
with the identification numbers N and N' in the operation j in addition to the 
calculation processing result by each of the arithmetic processing units so that the 
data array is collected in the (m + 1) times of operations in (2 m + k) units of the 
arithmetic processing units. 
[0024] 

Also, an invention described in claim 4 is, in a parallel computer system 
provided with (2 n + 2 m ) units of arithmetic processing units having unique identifiers, in 
which n and m are n > m, and individual storage devices and communication means 
corresponding to each of the arithmetic processing units, respectively, for performing 
data transmission / reception between each of the arithmetic processing units by the 
communication means, characterized in that when data arrays which were divided into 
(2" + 2 m ) pieces of small arrays and distributed to / calculated and processed at the 
(2" + 2 m ) units of the arithmetic processing units is to be collected in a single array 
again, the (2" + 2 m ) units of the arithmetic processing units are divided into a group G, 
made up of 2" units and a group G 2 made up of 2 m units, the data arrays are divided 
into two arrays, that is, an array A, made up of first 2 n pieces of small arrays and an 
array A 2 made up of the remaining 2 m pieces of small arrays, the arrays A, and A 2 are 
made to correspond to the groups G, and G 2 , respectively, and distribution and 
calculation processing is carried out, three processes are provided: a first process in 
which identification numbers 0, 1, 2" - 1 are given to 2" units of the arithmetic 
processing units of the group G,, the arithmetic processing unit with an identification 
number N' obtained by reversing a numeral on the 2' digit of the identification number 
N noted in binary is made to correspond to the arithmetic processing unit with the 
identification number N, an operation i for mutually transmitting / receiving a 
calculation processing result of the data array between the arithmetic processing unit 
with the identification number N and the arithmetic processing unit with the 
identification number N' is sequentially executed from i = 0 to i = n-1, and forj > 0, 
the calculation processing result obtained till the operation (j - 1) is transmitted / 
received between the arithmetic processing units with the identification numbers N 
and N' in the operation j in addition to the calculation processing result by each of the 
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arithmetic processing units so that the data array is collected in the group G,; a 
second process in which identification numbers 0, 1 , 2 m - 1 are given to 2 m units of 
the arithmetic units of the group G 2 , the arithmetic processing unit with an 
identification number N' obtained by reversing a numeral on the 2' digit of the 
identification number N noted in binary is made to correspond to the arithmetic 
processing unit with the identification number N, an operation i for mutually 
transmitting / receiving a calculation processing result of the data array between the 
arithmetic processing unit with the identification number N and the arithmetic 
processing unit with the identification number N' is sequentially executed from i = 0 
to i = n-1 , and for j > 0, the calculation processing result obtained till the operation (j - 
1) is transmitted / received between the arithmetic processing units with the 
identification numbers N and N' in the operation j in addition to the calculation 
processing result by each of the arithmetic processing units so that the data array is 
collected in the group G,; and a third process in which the array A, is transmitted from 
each of the arithmetic processing units of the group G, to those of the group G 2 , and 
the array A 2 from each of the arithmetic processing units of the group G 2 to those of 
the group G,, and the data array is collected in (2 n + 2 m ) units of the arithmetic 
processing units by executing the third process after the first and the second 
processes are executed in parallel. 
[0025] 

Also, an invention described in claim 5 is, in a parallel computer system 
provided with a plurality of arithmetic processing units having unique identifiers and 
individual storage devices and communication means corresponding to each of the 
arithmetic processing units, respectively, characterized in that 
when data arrays which were divided into: 
[Equation 8] 

+2 n * +2" 3 + — + 2"* I 

pieces of small arrays (n, >n 2 >n 3 > ■■■ >n k ^ 0) and distributed / calculated and 
processed at: 
[Equation 9] 

2". + 2"3+2" 3 +-" + 2"'l 

units of arithmetic processing units are to be collected again in a single array, among 
these arithmetic processing units, 
[Equation 10] 
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units are divided into k sets of groups as groups G,, G 2 , ■", G k , and in the small arrays, 
[0026] 

[Equation 1 1] 



pieces of the small arrays are divided into k pieces of arrays, respectively, as 
A, ,A 2 ,"\A k , these k pieces of arrays and k sets of the groups G,, G 2 , G k are made 
to contribute in a one-to-one manner for executing distribution and calculation 
processing, an in-group process p is executed, in which identification numbers 0, 1, ■■■ 
are given to (n p power of 2) units of the arithmetic processing units of the group G,, G 2 , 
■■■G k in the group G p for each p to become 1 ^ p ^ k, the arithmetic processing 
unit with an identification number N' obtained by reversing a numeral on the 2' digit of 
the identification number N noted in binary is made to correspond to the arithmetic 
processing unit with the identification number N, an operation i for mutually 
transmitting / receiving a calculation processing result of the data array between the 
arithmetic processing unit with the identification number N and the arithmetic 
processing unit with the identification number N' is sequentially executed from i = 0 
to i = n-1 , and for j > 0, the calculation processing result obtained till the operation Q - 
1) is transmitted / received between the arithmetic processing units with the 
identification numbers N and N' in the operation j in addition to the calculation 
processing result by each of the arithmetic processing units so that the data array A p 
is collected in each arithmetic processing unit in the group G p , and after the in-group 
process (k - 1) is finished, an inter-group process k is executed, in which a calculation 
result of the array A k is transmitted from the arithmetic processing units of the group 
G k to the arithmetic processing unit of the group G k _,, and then, an inter-group 
process p is executed in the descending order for p from p = k - 1 to p = 2, in which 
the calculation result of the array A p collected in each arithmetic processing unit of 
the group G p is transmitted from the arithmetic processing units of the group G p to 
each arithmetic processing unit belonging to the group G q to all the q to become q > p 
and the array A p , which is a calculation result of the group G p , and the calculation 
results of the arrays A pH , A k received from the arithmetic processing units of the 
group G p+1 are transmitted from the arithmetic processing units of the group G q to the 
arithmetic processing units of the group G pH so that the data array is collected in 




[0027] 
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[Equation 12] 

2" 1 +2" 2 +2" 3 + ■■• + 2"* 



units of the arithmetic processing units. 
[0028] 

At this time, k pieces of the in-group processes 1, 2, k are executed in 
parallel and when the in-group process s is finished, by sequentially executing the 
inter-group process (s + 1) for s to become 1 ^ s ^ k - 1, time required for the 
entire communication can be further reduced. 
[0029] 

Also, an invention described in claim 6 is characterized in that in data 
exchange between the groups of the arithmetic processing units using the parallel 
computer system described in claim 4 or 5, when a data array A collected and shared 
by a group G A made up of 2 P units of arithmetic processing units and a data array B 
collected and shared by a group G B made up of 2 q units of the arithmetic processing 
units (p > q) are mutually transmitted / received between the groups G A and G B , an 
operation in which each of 2" units of the arithmetic processing units selected from 
the group G A is made to correspond to each of the arithmetic processing units in the 
group G B in a one-to-one manner, and the data array A is transmitted to each 
arithmetic processing unit of the group G B is executed in parallel, while the group G A is 
divided into small groups a, ,Ci 2 ,--,a r made up of 2 p ~ q units of the arithmetic 
processing units (r=2 q ), respectively, and each small group is made to correspond to 
each of r units of the arithmetic processing units of the group G B in a one-to-one 
manner, and after the data array B is transmitted from the arithmetic processing unit 
of the group G B corresponding to the small group OL - t to the single arithmetic 
processing unit selected from the small group a the operation i to transmit / receive 
the data array B between the arithmetic processing units of the small group (X ■ is 
executed in parallel for i to become 1 S i S r so that the data array A and the data 
array B are shared by 2 P units of the arithmetic processing units and 2 q units of the 
arithmetic processing units. 
[0030] 

Also, an invention described in claim 7 is characterized in that the process for 
exchanging data between two arithmetic processing units using the parallel computer 
system described in any of claims 1 to 6 includes a first transmission process for 
transmission data from the arithmetic processing unit with a larger identification 
number to that with a smaller identification number and a second transmission 
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process for transmitting data from the arithmetic processing unit with a smaller 
identification number to that with a larger identification number, and one process 
selected from the first transmission process and the second transmission process is 
executed first and then, the other process is subsequently executed. 
[0031] 

[Embodiment of the Invention] 

Fig. 1 is a block diagram illustrating a configuration diagram of a parallel 
computer system. The parallel computer system shown here is provided with a 
single host computer 1 and 8 arithmetic processing units 2-1, 2-2, 2-8. The host 
computer is provided with a storage device 3 and communication means 4, while each 
of the arithmetic processing units 2-1, 2-2, 2-8 is provided with individual storage 
devices 5-1, 5-2, 5-8 and communication means 6-1, 6-2, 6-8. For example, 
input data or the like read by the host computer is transmitted to all the arithmetic 
processing units from the communication means 4 through the communication means 
6-1, 6-2, 6-8. Each of the arithmetic processing units 2-1, 2-2, 2-8 makes 
calculation of an assigned region, respectively, and sends / receives data via 
communication between the arithmetic processing units as necessary. 
[0032] 

On the basis of the configuration of the parallel computer system shown in 
Fig. 1, a first embodiment of the parallel computer system according to the present 
invention will be described. Fig. 2 is a chart illustrating a communication method 
between the arithmetic processing units of the parallel computer system in a time 
series in this embodiment. 
[0033] 

If identification numbers of the arithmetic processing units 2-1, 2-2, 2-8 
are 0, 1, 7, respectively, and they are noted in 3— digit numerals in binary, they are 
000, 001 , 01 0,01 1 , 1 00, 1 01 , 1 1 0, 1 1 1 , respectively. The array A made up of 8 x n 
pieces of data is divided into eight pieces of small arrays a,, a 2 , "\ a 8 made up of n 
pieces of data and assigned to 8 arithmetic processing units 2-1, 2-2, 2-8. After 
calculation processing is executed with regard to data of small arrays assigned by the 
respective arithmetic processing units, suppose that elements of the array A are 
collected in all the arithmetic processing units. Numerals 0 or 1 noted at each 
arithmetic processing unit in Fig. 2 indicates the divided small array, respectively, in 
which 0 indicates a state in which a calculation result has not been inputted yet, while 
1 indicates a state in which the calculation result has been already inputted. 
[0034] 
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As a first step, data is exchanged with the arithmetic processing unit having 
an identification number obtained by reversing a numeral on the 2° digit (1 for 0 or 0 
for 1). For example, the arithmetic processing unit 0 (000) exchanges n pieces of 
data with the arithmetic processing unit 1 (001), and the arithmetic processing unit 3 
(01 1) with the arithmetic processing unit 2 (010). 2n pieces of elements are 
collected in each arithmetic processing unit. 
[0035] 

At a second step, data is exchanged with the arithmetic processing unit 
having an identification number obtained by reversing the numeral on the 2 1 digit. 
For example, the arithmetic processing unit 0 (000) exchanges data with the 
arithmetic processing unit 2 (010), and the arithmetic processing unit 3 (01 1) with the 
arithmetic processing unit 1 (001). At this time, in the transmission from the 
arithmetic processing unit 0 to the arithmetic processing unit 2, for example, 2n 
pieces of data including data received from the arithmetic processing unit 1 in the first 
step is transmitted in addition to the calculation result of the arithmetic processing 
unit 0. As a result, 4n pieces of elements are collected in each of the arithmetic 
processing units. 
[0036] 

Lastly, as a third step, data is exchanged with the arithmetic processing unit 
having an identification number obtained by reversing a numeral on the 2 2 digit. For 
example, the arithmetic processing unit 0 (000) exchanges 4n pieces of data with the 
arithmetic processing unit 4 (100), and the arithmetic processing unit 3 (01 1) with the 
arithmetic processing unit 7(111). 8n pieces of elements are collected in each 
arithmetic processing unit, and the operation is completed. 
[0037] 

The above-described communication method is executed when the 
calculation is divided into 2 3 = 8 parts, and the number of steps at this time is 3. If 
the calculation is divided into 2 4 = 16 parts and executed by 16 arithmetic processing 
units similarly, another step is required in addition to the above-described 8-division 
case, which makes 4 steps in total. In general, if the calculation is divided into N 
parts and communication is made by N units of arithmetic processing units, the above 
method is used similarly, and the communication is completed in the number of steps 
log 2 N. 
[0038] 

Working effects of this embodiment will be verified below. For example, 
suppose that a size of an array is M (word), the number of units of arithmetic 
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processing units is K, and the entire array is divided into K parts and delivered to each 
of the arithmetic processing units. A value of K is supposed to be expressed as 
power of 2, which is the most generalized condition in parallel calculation, that is, K = 
2". Consider time required for creating a situation in which all the arithmetic 
processing units grasp data regarding the entire array via communication between the 
arithmetic processing units. In general, time T required for data transmission can be 

expressed as T=A+B X W (1). Here, A is time required for 

communication preparation and time required all the time for single communication 
regardless of a data amount to be transmitted. A value of A does not depend on the 
data amount. B x W is a term in proportion with the data amount, in which W is a data 
amount (number of WORDs) and b is transfer time per word. 
[0039] 

The number of steps for data transmission / reception is log 2 K =n. 
Transmission and reception is performed once each at each arithmetic processing 
unit at each step. The data amount transmitted / received at the m-th step is 
(M/K) X 2 m [word]. The number of transmission / reception times required for 
collection of data with the data amount M[word] at all the arithmetic processing units 
is 2n times per arithmetic processing unit, and the total data amount to be transmitted 
/ received is: 
[0040] 

[Equation 13] 



Thus, the total communication time T when the present invention is applied isT(K)= 



For comparison, all the data is collected in a single representative arithmetic 
processing unit by a prior-art method such as a Binary Tree method and 
communication time when data is similarly transmitted to all the arithmetic processing 
units by the Binary Tree method is acquired below. The number of transmission / 
reception times required for collection of all the data in a single arithmetic processing 
unit is n = log 2 K times in the representative arithmetic processing unit. Also, the 
data amount transmitted at the m-th step (m ^ n) is (M/K) X 2 m_1 [word]. Thus, 
timeT, required for collecting all the data in the representative arithmetic processing 




2A log 2 K+2M(1-1/K)B 
[0041] 



(2). 



unit is T, (K) = A log 2 K+M (1-1 /K)B 
[0042] 



(3). 
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The number of steps when data is distributed from the representative 
arithmetic processing unit to each arithmetic processing unit is log 2 K, and the number 
of communication times per arithmetic processing unit is log 2 K times at the maximum. 
However, M[word] data is transmitted at each step. Thus, T 2 required for distribution 
of data to each arithmetic processing unit is T 2 (K)= A log 2 K + (M log 2 K)B 

(4). Therefore, the total communication timeT 0 =T, +T 2 is T 0 (K)= 2A 

log 2 K+M0-1/K+ log 2 )B (5). 

[0043] 

Graphs in Figs. 3 and 4 show a relationship between increase in the number of 
arithmetic processing units and increase in communication time with the number of 
arithmetic processing units on the lateral axis and time required for communication on 
the vertical axis and show the relationship in the equation (5) by the prior-art Binary 
Tree communication method and the relationship in the equation (2) in which 
communication efficiency is improved by this embodiment in comparison. Reference 
numerals 10a and 10b indicated by a solid line in curves in the graphs show the case 
of the equation (2) in this embodiment, while reference numerals 11a and 11b 
indicated by a broken line show the prior art case of the equation (5). A curve with 
the reference numerals 10a and 11a shown in Fig. 3 assumes a situation in which a 
data amount to be communicated is small and A (communication rising time) in the 
equation (1) is substantially a half of the total communication time T, while a curve 
with the reference numerals 10b and 1 1b shown in Fig. 4 assumes a situation in which 
the data amount to be communicated is large and A (communication rising time) in the 
equation (1) is sufficiently smaller than the total communication time T. As obvious 
from these graphs, according to this embodiment, time required for communication 
can be reduced in either of cases that the number of arithmetic processing unit is 
small and large as compared with the prior-art method. That is, waiting time at data 
transmission / reception can be minimized, and high-speed calculation can be 
realized. 
[0044] 

In the parallel computer system made up of 16 arithmetic processing units, 
for example, it may be so configured that only arithmetic processing units in the 
number to become a power of 2 are extracted from a plurality of units, identification 
numbers for communication control are given to them, and divided distribution and 
calculation processing of arrays are executed according to the number of units such 
that divided distribution, calculation and collection of the array in the above-described 
three steps are executed among 8 arithmetic processing units among them. 
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[0045] 

In the above first embodiment, it is assumed that the number of related 
arithmetic processing units is a power of 2. A second embodiment detailed below is 
extended to a case in which the number of arithmetic processing units is not a power 
of 2 as a general condition, that is, a case in which the number of units is expressed as 
2" +k or the like. 
[0046] 

The second embodiment of the parallel computer system according to the 
present invention will be described. Here, a case in which an array of parallel 
calculation is divided into 6 parts and assigned to 6 arithmetic processing units 
(identification numbers are 0, 1, 5), for example. Fig. 5 is a chart illustrating a 
communication method between arithmetic processing units in the parallel computer 
system in this embodiment in a time series. For data processing at this time, 2 
arithmetic processing units (identification numbers are 6 and 7) are supposed to be 
used in addition to the above-described 6 arithmetic processing units. 
[0047] 

As a first step, data is exchanged with the arithmetic processing unit having 
an identification number obtained by reversing a numeral on the 2° digit. For example, 
the arithmetic processing unit 0 (000) exchanges n pieces of data with the arithmetic 
processing unit 1 (001), and the arithmetic processing unit 3 (01 1) with the arithmetic 
processing unit 2 (010). Since the arithmetic processing unit 6 (110) and the 
arithmetic processing unit 7 (1 1 1) do not have data to be exchanged, they are paused. 
At this time, 2n pieces of data are collected in the arithmetic processing units 0 to 5. 
[0048] 

At a second step, data is exchanged with the arithmetic processing unit 
having an identification number obtained by reversing the numeral on the 2 1 digit. 
For example, the arithmetic processing unit 4 (100) exchanges data with the 
arithmetic processing unit 6 (110), but since the arithmetic processing unit 6 (110) 
does not have data to be transmitted at this time, it only receives data from the 
arithmetic processing unit 4. Through this data exchange, 4n pieces of data are 
collected in the arithmetic processing units 0 to 3 and 2n pieces of data in the 
arithmetic processing units 4 to 7. 
[0049] 

At a third step, data is exchanged with the arithmetic processing unit having 
an identification number obtained by reversing a numeral on the 2 2 digit. For example, 
the arithmetic processing unit 6 (110) exchanges data with the arithmetic processing 
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unit 2 (010). The arithmetic processing unit 6 transmits 2n pieces of data to the 
arithmetic processing unit 2, and the arithmetic processing unit 2 transmits 4n pieces 
of data to the arithmetic processing unit 6. In this way, 6n pieces of data are 
delivered to all the 8 arithmetic processing units. 
[0050] 

In this embodiment, an arithmetic processing unit group of 2 n+1 units in which 
(2 n -k) units of the arithmetic processing units are added to (2 n + k) units of the 
arithmetic processing units is constituted in general, and the parallel calculation is 
performed at the steps detailed in the first embodiment in this arithmetic processing 
unit group. As a result, by configuring the arithmetic processing units in the number 
not a power of 2 based on the case of a power of 2, the working effects similar to the 
above first embodiment can be obtained. 
[0051] 

Subsequently, a third embodiment of a parallel computer system to the 
present invention will be described. In a communication method between the 
arithmetic processing units in this embodiment, a case in which an array is divided into 
6 small arrays and assigned to 6 arithmetic processing units (identification numbers 0, 
1, 5) will be described. First, the array is extended by 2 small arrays, and 0 is 
filled in the extended portion. In the case of an array made up of 12 elements 
(3,1,4,1,5,9,2,6,5,3,5,8), for example, an array (0,0,0,0) made up of 4 elements is added 
to have an array made up of 16 elements (3,1,4,1,5,9,2,6,5,3,5,8,0,0,0,0). 8 arithmetic 
processing units in which 2 arithmetic processing units (identification numbers are 6 
and 7) are added to the 6 arithmetic processing units are used. After that, 
communication is made among the 8 arithmetic processing units for data exchange 
according to the procedure detailed in the above first embodiment. 
[0052] 

In this embodiment, an arithmetic processing unit group of 2 n+1 units in which 
(2" -k) units of the arithmetic processing units are added to (2 n +k) units of the 
arithmetic processing units is constituted in general, and its array is extended to 2 n+1 
pieces and distributed to each of the arithmetic processing units, and parallel 
calculation and data collection are performed with a method similar to that in the 
above first embodiment. As a result, by configuring the arithmetic processing units 
in the number not a power of 2 based on the case of a power of 2, the working effects 
similar to the above first embodiment can be obtained. 
[0053] 
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Subsequently, a fourth embodiment of a parallel computer system according 
to the present invention will be described. In the second and third embodiments, a 
method for collecting data arrays in a single arithmetic processing unit by 2 n+1 units of 
arithmetic processing units if the number of array division is not a power of 2, that is, 
if the array is divided into (2 n +k) parts was described. On the other hand, in this 
embodiment, if the number of array division is 2" +2 m (n > m), processing is made by 
(2n + 2m) units of the arithmetic processing units. 
[0054] 

As a communication method between the arithmetic processing units of the 
parallel computer system in this embodiment, first, a case in which an array is divided 
into 6 parts and assigned to 6 arithmetic processing units (identification numbers are 
0, 1, 5) will be described as an example. Fig. 6 is a chart illustrating the 
communication method between the arithmetic processing units in this case in a time 
series. 
[0055] 

First, the 6 arithmetic processing units are divided into two groups. An 
arithmetic processing unit group 1 is constituted by 4 units with the identification 
numbers 0 to 3. An arithmetic processing unit group 2 is constituted by 2 units with 
the identification numbers 4 to 5. Subsequently, among the 4 units of the arithmetic 
processing unit group 1 and between the 2 units of the arithmetic processing unit 
group 2, data is collected in each group according to the procedure in the 
above-described second embodiment. The first and second steps in Fig. 6 
correspond to this. 
[0056] 

After that, data exchange is performed between the group 1 and the group 2 
according to the following procedure: 

[1] arithmetic processing unit 4 -> arithmetic processing unit 0, arithmetic processing 
unit 5 -> arithmetic processing unit 2 are simultaneously executed (corresponding to 
the third step in Fig. 6). 

[2] arithmetic processing unit 1 -> arithmetic processing unit 4, arithmetic processing 
unit 3 -> arithmetic processing unit 5 are simultaneously executed (corresponding to 
the fourth step in Fig. 6). 

[3] arithmetic processing unit 0 -> arithmetic processing unit 1, arithmetic processing 
unit 2 -> arithmetic processing unit 3 are simultaneously executed (corresponding to 
the fifth step in Fig. 6). 

By this method, data arrays can be collected by the six arithmetic processing units. 
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[0057] 

Also, another example of this embodiment, a case in which an array is divided 
into 10 parts and assigned to 10 arithmetic processing units (identification numbers 0, 
1, 9) will be described. Fig. 7 is a chart illustrating a communication method 
between arithmetic processing units in this case in a time series. 
[0058] 

First, the 10 arithmetic processing units are divided into two groups. An 
arithmetic processing unit group 1 is constituted by 8 units with the identification 
numbers 0, 1, 7. An arithmetic processing unit group 2 is constituted by 2 units 
with the identification numbers 8 and 9. Subsequently, among the 8 units of the 
arithmetic processing unit group 1 and between the 2 units of the arithmetic 
processing unit group 2, data is collected in each group according to the procedure in 
the above-described second embodiment. The first, second, and third steps in Fig. 7 
correspond to this. 
[0059] 

After that, data exchange is performed between the group 1 and the group 2 
according to the following procedure: 

[1 ' ] arithmetic processing unit 8 -> arithmetic processing unit 0, arithmetic 
processing unit 9 -> arithmetic processing unit 4 are simultaneously executed 
(corresponding to the fourth step in Fig. 7). 

[2'] arithmetic processing unit 3 -> arithmetic processing unit 8, arithmetic 
processing unit 7 -> arithmetic processing unit 9, processing unit 0 -> arithmetic 
processing unit 2, and processing unit 4 -> arithmetic processing unit 6 are 
simultaneously executed (corresponding to the fifth step in Fig. 7). 
[3'] arithmetic processing unit 0 -> arithmetic processing unit 1, arithmetic 
processing unit 4 -> arithmetic processing unit 5, arithmetic processing unit 2 -> 
arithmetic processing unit 3, and arithmetic processing unit 6 -> arithmetic 
processing unit 7 are simultaneously executed (corresponding to the sixth step in Fig. 
7). 

[0060] 

By this method, data arrays can be collected by the six arithmetic processing 
units. Distribution of data in the group 2, transmitted from the group 2 to the group 1 
is performed by the Binary Tree method. 
[0061] 

A fifth embodiment of a parallel computer system according to the present 
invention will be described below. A communication method between arithmetic 
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processing units in this embodiment is a method obtained by generalizing the 
communication method in the above fifth embodiment. A case in which an array is 
divided into 22 parts and assigned to 22 arithmetic processing units (identification 
numbers 0, 1, 21) will be described below as an example. Figs. 8 and 9 are charts 
illustrating a communication method between arithmetic processing units in a time 
series in the case of this 22 division of the array. Fig. 8 shows first to fourth steps, 
and Fig. 9 shows fifth to eighth steps. 
[0062] 

Since it is 22=2 4 +2 2 +2', the arithmetic processing units are divided into the 
following three groups: 

Group 1: arithmetic processing units (16 units) with the identification numbers 0,1, 
15; 

Group 2: arithmetic processing units (4 units) with the identification numbers 16, 17, 
18, 19; and 

Group 3: arithmetic processing units (2 units) with the identification numbers 20, 21. 
[0063] 

Subsequently, data is collected in each group among the 16 units of the 
arithmetic processing unit group 1, among 4 units of the arithmetic processing unit 
group 2, and between 2 units of the arithmetic processing unit group 3 according to 
the method in the above first embodiment. This corresponds to the first to fourth 
steps shown in Fig. 8. 
[0064] 

After that, data is exchanged between the groups according to a procedure 
similar to the method described in the above second or third embodiment. The data 
communication method will be sequentially described below. First, at a second step, 
data collection in the group 2 is finished, but since data collection of the group 3 has 
been already finished at that time, as the subsequent step, data is exchanged between 
the arithmetic processing units 16 and 18 of the group 2 and the arithmetic processing 
units 1 9 and 20 in the group 3, respectively. This corresponds to the third step in 
the group 2 and the group 3 shown in Fig. 8. At this time, all the data in the group 2 
and the group 3 has been stored in all the arithmetic processing units in the group 3. 
[0065] 

Subsequently, data received from the group 3 is transmitted from the 
arithmetic processing units 16 and 18 of the group 2 in which all the data of the group 
2 and the group 3 are stored to the arithmetic processing units 17 and 19 of the group 
2, respectively. This corresponds to the fourth step in the group 2 shown in Fig. 8. 



Application No. 10/561,872 
Atty. Docket No.: 1215.004 



20 



[0066] 

In the group 1, data collection is finished between each arithmetic processing 
unit at the fourth step, and as the subsequent step, data is transmitted / received 
between the group 1 and the groups 2, 3. First, data is transmitted from the 
arithmetic processing units 0, 1,2, 3, 4, 5 of the group 1 to the arithmetic processing 
units 16, 17, 18, 19, 20, 21 of the groups 2, 3. As a result, in the groups 2, 3, data 
collection of all the 22 arithmetic processing units in the groups 1, 2, 3 is completed. 
This corresponds to the fifth step shown in Fig. 9. 
[0067] 

Subsequently, 16 arithmetic processing units in the group 1 are divided into 
four small groups. That is, they are divided into a small group 1 of the group 1: the 
arithmetic processing units 0, 1, 2, 3, a small group 2: the arithmetic processing units 4, 
5, 6, 7, a small group 3: the arithmetic processing units 8, 9, 1 0, 1 1 , and a small group 4: 
the arithmetic processing units 12, 13, 14, 15. 
[0068] 

One unit each of the arithmetic processing unit is selected from each of the 
small groups. Here, suppose that the arithmetic processing units 0, 4, 8, 12 are 
selected. To these 4 arithmetic processing units, data collected for the group 2 and 
the group 3 from the arithmetic processing units 16, 17, 18, 19 of the group 2 is 
transmitted. This corresponds to the sixth step shown in Fig. 9. 
[0069] 

Subsequently, in each small group of the group 1, data is transmitted / 
received for the groups 2, 3 between the arithmetic processing units by the prior-art 
Binary Tree method so that data of the groups 1, 2, 3 is collected in all the arithmetic 
processing units of the small groups. For example, data is transmitted from the 
arithmetic processing unit 0 to the arithmetic processing unit 2 in the small group 1, 
and then, data is transmitted from the arithmetic processing units 0, 2 to the 
arithmetic processing units 1, 3, respectively. The same applies to the other small 
groups. This corresponds to the seventh step and the eighth step shown in Fig. 9. 
Then, collection of 22 data arrays is completed in all the 22 arithmetic processing 
units. 
[0070] 

In general, data arrays in the arithmetic processing units in the number that 
cannot be expressed by a power of 2 can be collected by the method described above. 
First, using k pieces of integers n, ,n 2 ,n 3 ,"-,n k (where n, >n 2 >n 3 >" - >n k ^ 0), the 
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number of arithmetic processing units of the parallel computer system is expressed 
as: 

[0071] 

[Equation 14] 

a"'+2" 2 +2' i3 + "-+2' i M 

Also, the data array is divided into small arrays in the same number as the number of 
units and divided to each of the arithmetic processing units for calculation processing. 
Among the arithmetic processing units of the parallel computer system, 
[0072] 

[Equation 15] 

2" 2 , T\ 2"* 
units are grouped into groups G,, G 2 ,--, G k , and the arithmetic processing units of the 
parallel computer system are divided into k sets of groups. Similarly, 
[0073] 

[Equation 16] 

E"i , 2"* . 2* , -,2"' I 

pieces of small arrays of the data array are divided into k pieces of arrays as A,, A 2 

A, 

[0074] 

Subsequently, an operation defined in the following double quotations 
(hereinafter referred to as an in-group process p) is executed for all the p' s to 
become 1 S p S k. However, the in-group processes 1, .., k shall be executed in 
parallel. 
[0075] 

"Identification numbers 0, 1, (n p power of 2 -1) are given to (n p power of 2) 
units of the arithmetic processing units in the group G p . Then, an operation q defined 
in the following single quotations is executed for all the q' s to become 0 S q ^ 
p-1 sequentially from q = 0 to q = p-1. 

'The arithmetic processing unit with an identification number N' obtained by 
reversing a numeral on the 2° digit of the identification number N expressed in binary 
is made to correspond to the arithmetic processing unit with the identification number 
N, and calculation processing results of the data array are mutually transmitted / 
received between the arithmetic processing unit with the identification number N and 
the arithmetic processing unit with the identification number N' . However, for q > 0, 
the calculation processing result obtained till the operation (q - 1) is transmitted / 
received in the operation q in addition to the calculation result by each arithmetic 
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processing unit between the arithmetic processing units with the identification 
numbers N and N' . 

Through this operation, the data arrays are collected in (n p power of 2) units 
of the arithmetic processing units in the group G p ." 

According to the group setting method, when the in-group processes 1, k 
are executed in parallel, the in-group process k is firstly finished and then, the 
in-group processes are finished in the order of (k-1), 2, 1. Considering this fact, 
an operation defined in the following parentheses (hereinafter referred to as an 
inter-group process p) is executed for p in the descending order from p = k - 1 to p = 
1. 

[0076] 

(After the in-group process p is finished, data of an array A p collected in each 
arithmetic processing unit of the group G p is transmitted from the arithmetic 
processing units in the group G p to all the arithmetic processing units belonging to the 
groups G p+ ,, G k . That is, in (n p power of 2) units of the arithmetic processing units 
belonging to the group G p , 
[0077] 

[Equation 17] 
\2"p +1 + ... +2"* 

units are selected, and the selected arithmetic processing units are made to 
correspond to the arithmetic processing units belonging to the groups G p+1 , G k in a 
one-to-one manner, and the data of the array A p is transmitted from the group G p to 
the groups G p+1 , G k . Then, data is transmitted from the group G p+1 to the group G p . 
The group G p made up of (n p power of 2) units of the arithmetic processing units is 
divided into small groups a,, a r made up of 
[0078] 

[Equation 18] 

units of the arithmetic processing units. The number r of these small groups is: 
[0079] 

[Equation 19] 

V = 2 Hp+1 

Here, the arithmetic processing units belonging to the group G p+1 are noted as b,, b r . 
The small groups ff,, a r of the group G p are made to correspond to the arithmetic 
processing units b,, b r belonging to the group G p+1 in a one-to-one manner, and an 
operation to transmit data of an array A p+1 collected in the group G p+1 from the 
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arithmetic processing unit b : of the group G p+1 to the single arithmetic processing unit 
a- selected from the small group a. - is executed in parallel for all the i' s to become 1 
i r. At this time, in the case of p < k - 1 , the data arrays A p+2 , A k received 
from the groups G p+2 , G k are included in transmission from the arithmetic 
processing units b : to a-. 
[0080] 

After that, in each of the small groups 01 ; , data transmission is performed 
with the prior-art Binary Tree method from the arithmetic processing unit a- to all the 
arithmetic processing units except a-. As a result, collection of data arrays is 
completed for the data arrays A p , A k for all the arithmetic processing units of the 
group Gp.) With this method, the data arrays which are distributed by a plurality of 
arithmetic processing units to each of the arithmetic processing units and calculated 
in parallel can be efficiently collected in general, and high-speed calculation can be 
realized. 



